Multi-page document VQA with Recurrent Memory Transformer


Authors

Qi Dong (Comtuper Vision Center)*; Lei Kang (Computer Vision Center); Dimosthenis Karatzas (Computer Vision Centre)
qdong@cvc.uab.cat*; lkang@cvc.uab.es; dimos@cvc.uab.es

Abstract

Multi-page document Visual Question Answering (VQA) poses realistic challenges in the realm of document understanding due to its complexity and volume of information distributed across multiple pages. Current state-of-the-art methods often struggle to process lengthy documents, because they either exceed the model's input token limits when treated as single-page document VQA problems, or compress pages into vectors that may omit crucial information. To our knowledge, our proposed method is the first to integrate recurrent memory mechanisms with the transformer architecture specialized for multi-page document VQA. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance while maintaining a manageable model size.